On the Use of Non-Stationary Policies for Infinite-Horizon Discounted Markov Decision Processes
نویسنده
چکیده
We consider infinite-horizon γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. We consider the algorithm Value Iteration and the sequence of policies π1, . . . , πk it implicitely generates until some iteration k. We provide performance bounds for non-stationary policies involving the last m generated policies that reduce the state-of-the-art bound for the last stationary policy πk by a factor 1−γ 1−γm . In particular, the use of non-stationary policies allows to reduce the usual asymptotic performance bounds of Value Iteration with errors bounded by ǫ at each iteration from γ (1−γ)2 ǫ to γ 1−γ ǫ, which is significant in the usual situation when γ is close to 1. Given Bellman operators that can only be computed with some error ǫ, a surprising consequence of this result is that the problem of “computing an approximately optimal non-stationary policy” is much simpler than that of “computing an approximately optimal stationary policy”, and even slightly simpler than that of “approximately computing the value of some fixed policy”, since this last problem only has a guarantee of 1 1−γ ǫ. Given a Markov Decision Process, suppose on runs an approximate version of Value Iteration, that is one builds a sequence of value-policy pairs as follows: Pick any πk+1 in Gvk vk+1 = Tπk+1vk + ǫk+1 where v0 is arbitrary, Gvk is the set of policies that are greedy 1 with respect to vk, and Tπk is the linear Bellman operator associated to policy πk. Though it does not appear exactly in this form in the literature, the following performance bound is somewhat standard. Theorem 1. Let ǫ = max1≤j<k ‖ǫj‖sp be a uniform upper bound on the span seminorm 2 of the errors before iteration k. The loss of policy πk is bounded as follows: ‖v∗ − vπk‖∞ ≤ 1 1− γ ( γ − γ 1− γ ǫ+ γ‖v∗ − v0‖sp ) . (1) In Theorem 2, we will prove a generalization of this result, so we do not provide a proof here. Since for any f , ‖f‖sp ≤ 2‖f‖∞, Theorem 1 constitutes a slight improvement and a (finite-iteration) generalization of the following well-known performance bound (see [1]): lim sup k→∞ ‖v∗ − vπk‖∞ ≤ 2γ (1− γ)2 max k ‖ǫk‖∞. 1There may be several greedy policies with respect to some value v, and what we write here holds whichever one is picked. 2For any function f defined on the state space, the span seminorm of f is ‖f‖sp = maxs f(s) − mins f(s). The motivation for using the span seminorm instead of a more usual L∞-norm is twofold: 1) it slightly improves on the state-of-the-art bounds and 2) it simplifies the construction of an example in the proof of the forthcoming Proposition 1.
منابع مشابه
On the Use of Non-Stationary Policies for Stationary Infinite-Horizon Markov Decision Processes
We consider infinite-horizon stationary γ-discounted Markov Decision Processes, for which it is known that there exists a stationary optimal policy. Using Value and Policy Iteration with some error ǫ at each iteration, it is well-known that one can compute stationary policies that are 2γ (1−γ)2 ǫ-optimal. After arguing that this guarantee is tight, we develop variations of Value and Policy Iter...
متن کاملMarkov decision processes with fuzzy rewards
In this paper, we consider the model that the information on the rewards in vector-valued Markov decision processes includes imprecision or ambiguity. The fuzzy reward model is analyzed as follows: The fuzzy reward is represented by the fuzzy set on the multi-dimensional Euclidian space R and the infinite horizon fuzzy expected discounted reward(FEDR) from any stationary policy is characterized...
متن کاملA fuzzy approach to Markov decision processes with uncertain transition probabilities
In this paper, a Markov decision model with uncertain transition matrices, which allow a matrix to fluctuate at each step in time, is described by the use of fuzzy sets. We find a pareto optimal policy maximizing the infinite horizon fuzzy expected discounted reward over all stationary policies under some partial order. The pareto optimal policies are characterized by maximal solutions of an op...
متن کاملRisk-sensitive and minimax control of discrete-time, finite-state Markov decision processes
This paper analyzes a connection between risk-sensitive and minimax criteria for discrete-time, nite-states Markov Decision Processes (MDPs). We synthesize optimal policies with respect to both criteria, both for nite horizon and discounted in nite horizon problem. A generalized decision-making framework is introduced, which includes as special cases a number of approaches that have been consid...
متن کاملNon-randomized policies for constrained Markov decision processes
This paper addresses constrained Markov decision processes, with expected discounted total cost criteria, which are controlled by nonrandomized policies. A dynamic programming approach is used to construct optimal policies. The convergence of the series of finite horizon value functions to the infinite horizon value function is also shown. A simple example illustrating an application is presented.
متن کاملOn the Convergence of Optimal Actions for Markov Decision Processes and the Optimality of (s, S) Inventory Policies
This paper studies convergence properties of optimal values and actions for discounted and averagecost Markov Decision Processes (MDPs) with weakly continuous transition probabilities and applies these properties to the stochastic periodic-review inventory control problem with backorders, positive setup costs, and convex holding/backordering costs. The following results are established for MDPs...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1203.5532 شماره
صفحات -
تاریخ انتشار 2012